Modeling the Synchrony between Audio and Visual Modalities for Speaker Identification
نویسندگان
چکیده
This work aims to understand and model the inter-modal temporal relations between the audio and visual modalities of speech and validate whether the captured relations can improve the performance of audio-visual bimodal modeling for such applications as audio-visual speaker identification. We propose to extend our audio-visual correlative model (AVCM) with explicit durational modeling of the partial temporal synchrony between the two speech modalities, i.e. where the audio may lead, lag or remain synchronized with the video. We refer to the new extended model as DurationalAVCM. Experiments on the CMU database and a homegrown database demonstrate that Durational-AVCM can improve the accuracies of audio-visual speaker identification at all levels of acoustic signal-to-noise ratios (SNR) from 0dB to 30dB with varying acoustic conditions compared to original AVCM model. The results indicate the importance of incorporating the partial temporal synchrony between audio and visual modalities for audio-visual bimodal modeling.
منابع مشابه
Audio-Visual Correlation Modeling for Speaker Identification and Synthesis
This thesis addresses two major problems of multimodal signal processing using audiovisual correlation modeling: speaker recognition and speaker synthesis. We address the first problem, i.e., the audiovisual speaker recognition problem within an open-set identification framework, where audio (speech) and lip texture (intensity) modalities are fused employing a combination of early and late inte...
متن کاملDetecting audio-visual synchrony using deep neural networks
In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investig...
متن کاملAudio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities
An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The v...
متن کاملRobust audio-visual speech synchrony detection by generalized bimodal linear prediction
We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed ti...
متن کاملAudio-visual synchronisation for speaker diarisation
The role of audio–visual speech synchrony for speaker diarisation is investigated on the multiparty meeting domain. We measured both mutual information and canonical correlation on different sets of audio and video features. As acoustic features we considered energy and MFCCs. As visual features we experimented both with motion intensity features, computed on the whole image, and Kanade Lucas T...
متن کامل